AITopics | ak 2

Collaborating Authors

ak 2

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Theoretical Investigation of Adafactor for Non-Convex Smooth Optimization

Neural Information Processing SystemsJun-19-2026, 16:26:22 GMT

Adafactor is an early memory-efficient optimization algorithm proposed as an alternative to Adam. By eliminating first-order momentum and employing a rank-1 matrix factorization to approximate the second-moment matrix, Adafactor achieves near-zero memory overhead compared to traditional gradient descent methods. Despite its practical suitability for large-scale training tasks where memory efficiency is critical, its theoretical convergence analysis remains unexplored, largely due to the challenges posed by its matrix factorization and update clipping mechanisms. In this work, we provide a convergence analysis of Adafactor for non-convex smooth optimization. We establish optimal convergence rates (up to logarithmic factors) for finding stationary points in both deterministic and stochastic settings, the latter under sub-Gaussian noise. Central to our analysis is viewing Adafactor as an approximation of Adam, and the use of a new proxy step-size to approximate the unique adaptive step-size induced by Adafactor's matrix factorization and update clipping, along with an induction argument to control the gradient magnitude. Our findings may theoretically suggest that involving rank-1 matrix approximation of the second-moment matrix in Adam does not fundamentally hinder the convergence.

adafactor, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

Zhang, Chenyang, Cao, Yuan

arXiv.org Machine LearningMay-8-2026

One widely recognized interpretation for their empirical success is their ability to perform in-context learning (ICL): pretrained transformers are capable of performing previously unseen tasks based on demonstrations and examples in the prompt, without requiring any additional task-specific fine-tuning (Brown et al., 2020). A line of recent works interpret the in-context learning (ICL) capability of transformers from an algorithmic perspective, viewing transformers as models that can implicitly execute certain learning algorithms on the context examples. Specifically, Garg et al. (2022) proposes a theoretical framework for ICL in terms of learning a hypothesis class, and empirically shows that transformers can in-context learn the linear function class. Motivated by this empirical finding, several recent works attempt to theoretically study how transformers perform in-context learning on linear regression tasks. Aky urek et al. (2022); Von Oswald et al. (2023) construct multi-layer transformers with linear attention that can execute gradient descent on the an "in-context loss" defined on the context data, thereby enabling in-context learning of linear regression.

ak 1, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2605.06609

Genre:

Research Report > New Finding (0.64)
Research Report > Experimental Study (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Supplementary Material Proofs from Section 2

Neural Information Processing SystemsApr-24-2026, 18:30:00 GMT

The proof of Claim 2.3 is obtained via the following calculation, using the definition of Hermite tensor (Definition 2.2). We will use i,j for indexes in [d]. The above is equivalent to Hk(Bx) = B kHk(x). We construct the truncated distribution A as follows. We first sample x A, then we reject x unless x 2 B. Let A be the distribution of the samples we get from this process. Using Markov's inequality and union bound, we have Then it only remains to verify Ex A [Hk(x)] Ex Nm[Hk(x)] 2 for any k < d.

ak 2, artificial intelligence, lemma 3, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.46)

Add feedback

0d00a699f60e642b310eb04b76cc7731-Supplemental-Conference.pdf

Neural Information Processing SystemsNov-13-2025, 12:41:12 GMT

ak 2, artificial intelligence, lemma 3, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.46)

Add feedback

SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions

Diakonikolas, Ilias, Kane, Daniel, Ren, Lisheng, Sun, Yuxin

arXiv.org Machine LearningMar-7-2024

We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a general methodology to prove SQ lower bounds for this task that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and a distribution that behaves like $A$ in a random hidden direction and like a standard Gaussian in the orthogonal complement, is SQ-hard. The required conditions were that (1) $A$ matches many low-order moments with the standard univariate Gaussian, and (2) the chi-squared norm of $A$ with respect to the standard Gaussian is finite. While the moment-matching condition is necessary for hardness, the chi-squared condition was only required for technical reasons. In this work, we establish that the latter condition is indeed not necessary. In particular, we prove near-optimal SQ lower bounds for NGCA under the moment-matching condition only. Our result naturally generalizes to the setting of a hidden subspace. Leveraging our general SQ lower bound, we obtain near-optimal SQ lower bounds for a range of concrete estimation tasks where existing techniques provide sub-optimal or even vacuous guarantees.

algorithm, probability, sq lower bound, (15 more...)

arXiv.org Machine Learning

2403.04744

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

Ziyin, Liu, Liu, Kangqiao, Mori, Takashi, Ueda, Masahito

arXiv.org Machine LearningFeb-10-2021

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, remains poorly understood despite its enormous practical importance in offering good training efficiency and generalization ability. In this work, we study the minibatch noise in SGD. Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge. We first derive the analytically solvable results for linear regression under various settings, which are compared to the commonly used approximations that are used to understand SGD noise. We show that some degree of mismatch between model and data complexity is needed in order for SGD to "cause" a noise, and that such mismatch may be due to the existence of static noise in the labels, in the input, the use of regularization, or underparametrization. Our results motivate a more accurate general formulation to describe minibatch noise.

fluctuation, noise, regularization, (15 more...)

arXiv.org Machine Learning

2102.05375

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback